Structural Equivalence Between Co-occurrences of Characters and Words in the Chinese Language
نویسندگان
چکیده
Complex networks are constructed for studying the co-occurrence of characters and words in the Chinese language. Two types of networks are investigated. In the first type, nodes correspond to Chinese characters, and in the second type, nodes correspond to Chinese words. Moreover, edges correspond to connections of characters and/or words that occur consecutively. Networks are built from a collection of Chinese texts of four different styles, namely, essays, novels, popular science articles, and news reports. Their statistical properties are studied in terms of some complex network parameters, including average degree, diameter, average path length, clustering coefficient, degree distribution, as well as connected subnetworks. It is found that although these two kinds of networks have different parameter values, they display qualitatively similar properties, such as exhibition of small-world and scale-free features. This qualitative equivalence between the network of Chinese characters and the network of Chinese words provides a valid basis on which either types of networks can be used for comparing different languages regardless of the incompatibility of the linguistic roles that words play in the Chinese language and in other languages.
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کاملUsing $k$-way Co-occurrences for Learning Word Embeddings
Co-occurrences between two words provide useful insights into the semantics of those words. Consequently, numerous prior work on word embedding learning have used co-occurrences between two words as the training signal for learning word embeddings. However, in natural language texts it is common for multiple words to be related and co-occurring in the same context. We extend the notion of co-oc...
متن کامل“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press
This article studies the depiction of Chinese miners in the Ghanaian news website entitled Modern Ghana. A total of 87 articles comprising 43752 words were retrieved. Van Leeuwen’s (2008) theory of the representation of the social actors was utilised to examine the depiction of Chinese miners in the Ghanaian press. In this regard, six applicable tools were used and these include exclusion, role...
متن کاملVariations of the Morse-Hedlund Theorem for k-Abelian Equivalence
In this paper we investigate local-to-global phenomena for a new family of complexity functions of infinite words indexed by k ∈ N1∪{+∞} where N1 denotes the set of positive integers. Two finite words u and v in A∗ are said to be k-abelian equivalent if for all x ∈ A∗ of length less than or equal to k, the number of occurrences of x in u is equal to the number of occurrences of x in v. This def...
متن کامل